AITopics | action chunk

Collaborating Authors

action chunk

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Decoupled Q-Chunking

Li, Qiyang, Park, Seohong, Levine, Sergey

arXiv.org Machine LearningDec-15-2025

Temporal-difference (TD) methods learn state and action values efficiently by bootstrapping from their own future value predictions, but such a self-bootstrapping mechanism is prone to bootstrapping bias, where the errors in the value targets accumulate across steps and result in biased value estimates. Recent work has proposed to use chunked critics, which estimate the value of short action sequences ("chunks") rather than individual actions, speeding up value backup. However, extracting policies from chunked critics is challenging: policies must output the entire action chunk open-loop, which can be sub-optimal for environments that require policy reactivity and also challenging to model especially when the chunk length grows. Our key insight is to decouple the chunk length of the critic from that of the policy, allowing the policy to operate over shorter action chunks. We propose a novel algorithm that achieves this by optimizing the policy against a distilled critic for partial action chunks, constructed by optimistically backing up from the original chunked critic to approximate the maximum value achievable when a partial action chunk is extended to a complete one. This design retains the benefits of multi-step value propagation while sidestepping both the open-loop sub-optimality and the difficulty of learning action chunking policies for long action chunks. We evaluate our method on challenging, long-horizon offline goal-conditioned tasks and show that it reliably outperforms prior methods. Code: github.com/ColinQiyangLi/dqc.

action chunk, preprint, supp, (14 more...)

arXiv.org Machine Learning

2512.10926

Country:

North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
Europe > Finland > Uusimaa > Helsinki (0.04)
Asia > South Korea > Daegu > Daegu (0.04)

Genre:

Research Report (1.00)
Workflow (0.88)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Training-Time Action Conditioning for Efficient Real-Time Chunking

Black, Kevin, Ren, Allen Z., Equi, Michael, Levine, Sergey

arXiv.org Artificial IntelligenceDec-10-2025

Real-time chunking (RTC) enables vision-language-action models (VLAs) to generate smooth, reactive robot trajectories by asynchronously predicting action chunks and conditioning on previously committed actions via inference-time inpainting. However, this inpainting method introduces computational overhead that increases inference latency. In this work, we propose a simple alternative: simulating inference delay at training time and conditioning on action prefixes directly, eliminating any inference-time overhead. Our method requires no modifications to the model architecture or robot runtime, and can be implemented with only a few additional lines of code. In simulated experiments, we find that training-time RTC outperforms inference-time RTC at higher inference delays. In real-world experiments on box building and espresso making tasks with the $π_{0.6}$ VLA, we demonstrate that training-time RTC maintains both task performance and speed parity with inference-time RTC while being computationally cheaper. Our results suggest that training-time action conditioning is a practical drop-in replacement for inference-time inpainting in real-time robot control.

artificial intelligence, arxiv preprint arxiv, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2512.05964

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via Neural Action Tokenization

Liu, Yicheng, Zhang, Shiduo, Dong, Zibin, Ye, Baijun, Yuan, Tianyuan, Yu, Xiaopeng, Yin, Linqi, Lu, Chenhao, Shi, Junhao, Yu, Luca Jiang-Tao, Zheng, Liangtao, Jiang, Tao, Gong, Jingjing, Qiu, Xipeng, Zhao, Hang

arXiv.org Artificial IntelligenceDec-9-2025

UCSD Figure 1: F AST er combines a learnable action tokenizer (FASTerVQ) and an autoregressive VLA model (FASTerVLA), achieving efficient compression, fast control, and strong performance across eight real and simulated embodiments. Autoregressive vision-language-action (VLA) models have recently demonstrated strong capabilities in robotic manipulation. However, their core process of action tokenization often involves a trade-off between reconstruction fidelity and inference efficiency. We introduce F AST er, a unified framework for efficient and generalizable robot learning that integrates a learnable tokenizer with an autore-gressive policy built upon it. FASTerVLA builds on this tokenizer with block-wise autore-gressive decoding and a lightweight action expert, achieving both faster inference and higher task performance. Extensive experiments across simulated and real-world benchmarks show that FASTerVQ delivers superior reconstruction quality, high token utilization, and strong cross-task and cross-embodiment generalization, while FASTerVLA further improves overall capability, surpassing previous state-of-the-art VLA models in both inference speed and task performance. Vision-Language-Action (VLA) models represent a paradigm shift in robotics, embodying generalist robot policies trained on increasingly large-scale robotic datasets (Chenjia Bai, 2024). These models are categorized primarily by their method of robot action prediction, with the most prominent approaches being diffusion-based (Team et al., 2024; Black et al., 2024) and autoregressive VLA (Belkhale & Sadigh, 2024; Kim et al., 2024; Pertsch et al., 2025; Zhou et al., 2025) models. While diffusion-based models have demonstrated superior precision in manipulation tasks, they often exhibit a notable deficiency in leveraging critical visual and linguistic cues (Pertsch et al., 2025; Dong et al., 2025). In contrast, recent research indicates that a carefully designed autoregres-sive VLA model can increasingly bridge the performance gap with its diffusion-based counterparts, while simultaneously offering enhanced instruction-following capabilities (Pertsch et al., 2025; Intelligence et al., 2025; Hancock et al., 2025), superior scene generalization (Pertsch et al., 2025), and effective transfer of common-sense knowledge (Brohan et al., 2023). Most importantly, autoregres-sive VLA models share the most architectural similarity to the highly successful Vision-Language Models (VLMs), suggesting significant potential for future advancements. A pivotal challenge within autoregressive VLA models is the development of an appropriate tok-enization scheme to discretize continuous robot action sequence into action tokens (Wang et al., 2025c; Pertsch et al., 2025). Numerous sequence modeling studies, including LLMs and Speech-LLMs, have demonstrated that tokenizer quality directly determines model performance (Radford et al., 2019; Zhang et al., 2023; Gong et al., 2025).

arxiv preprint arxiv, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2512.04952

Country: Asia > China (0.46)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Real-Time Execution of Action Chunking Flow Policies

Black, Kevin, Galliker, Manuel Y., Levine, Sergey

arXiv.org Artificial IntelligenceDec-8-2025

Modern AI systems, especially those interacting with the physical world, increasingly require real-time performance. However, the high latency of state-of-the-art generalist models, including recent vision-language action models (VLAs), poses a significant challenge. While action chunking has enabled temporal consistency in high-frequency control tasks, it does not fully address the latency problem, leading to pauses or out-of-distribution jerky movements at chunk boundaries. This paper presents a novel inference-time algorithm that enables smooth asynchronous execution of action chunking policies. Our method, real-time chunking (RTC), is applicable to any diffusion- or flow-based VLA out of the box with no re-training. It generates the next action chunk while executing the current one, "freezing" actions guaranteed to execute and "inpainting" the rest. To test RTC, we introduce a new benchmark of 12 highly dynamic tasks in the Kinetix simulator, as well as evaluate 6 challenging real-world bimanual manipulation tasks. Results demonstrate that RTC is fast, performant, and uniquely robust to inference delay, significantly improving task throughput and enabling high success rates in precise tasks $\unicode{x2013}$ such as lighting a match $\unicode{x2013}$ even in the presence of significant latency. See https://pi.website/research/real_time_chunking for videos.

large language model, machine learning, real time system, (20 more...)

arXiv.org Artificial Intelligence

2506.07339

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.66)

Industry:

Energy (0.68)
Information Technology (0.46)
Transportation > Ground > Road (0.46)
Automobiles & Trucks (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(4 more...)

Add feedback

Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach

Yang, Siyuan, Zhang, Yang, He, Haoran, Pan, Ling, Li, Xiu, Bai, Chenjia, Li, Xuelong

arXiv.org Artificial IntelligenceDec-3-2025

Vision-Language-Action (VLA) models, trained via flow-matching or diffusion objectives, excel at learning complex behaviors from large-scale, multi-modal datasets (e.g., human teleoperation, scripted policies). However, since VLAs incorporate diverse data modes in the pre-training stage, and the finetuning dataset often contains demonstration data collected in a kinematically suboptimal or undesirable way, it exists redundant action modes that are irrelevant to the success action modes of the downstream task. Specifically, we observe a critical inference-time fragility among various sampled noises after supervised finetuning of pre-trained VLAs. In this paper, we attribute this instability to the distribution shift between the VLA policy and the policy induced by stable success modes of the downstream task dataset. Thus, we propose \textbf{TACO}, a test-time-scaling (TTS) framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks. The VLA models integrated with TACO can execute the actions with maximum pseudo-count from all sampled action chunks, thereby preventing distribution shifts while preserving the generalization ability of VLAs since the constraint is applied only during inference. Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits compared to RL update, especially for flow or diffusion-based VLAs which are difficult to perform RL update due to denoising process. Extensive experiments across four simulation benchmarks (RoboTwin2.0, Robotwin, LIBERO, SimplerEnv) and a dual-arm platform demonstrate that our method significantly improves the inference stability and success rates in downstream-task adaptations.

arxiv preprint arxiv, machine learning, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

2512.02834

Country: Asia (0.46)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

DiG-Flow: Discrepancy-Guided Flow Matching for Robust VLA Models

Zhang, Wanpeng, Wang, Ye, Luo, Hao, Yuan, Haoqi, Feng, Yicheng, Zheng, Sipeng, Jin, Qin, Lu, Zongqing

arXiv.org Artificial IntelligenceDec-2-2025

Vision-Language-Action (VLA) models trained with flow matching have demonstrated impressive capabilities on robotic manipulation tasks. However, their performance often degrades under distribution shift and on complex multi-step tasks, suggesting that the learned representations may not robustly capture task-relevant semantics. We introduce DiG-Flow, a principled framework that enhances VLA robustness through geometric regularization. Our key insight is that the distributional discrepancy between observation and action embeddings provides a meaningful geometric signal: lower transport cost indicates compatible representations, while higher cost suggests potential misalignment. DiG-Flow computes a discrepancy measure between empirical distributions of observation and action embeddings, maps it to a modulation weight via a monotone function, and applies residual updates to the observation embeddings before flow matching. Crucially, this intervention operates at the representation level without modifying the flow matching path or target vector field. We provide theoretical guarantees showing that discrepancy-guided training provably decreases the training objective, and that guided inference refinement converges with contraction. Empirically, DiG-Flow integrates into existing VLA architectures with negligible overhead and consistently improves performance, with particularly pronounced gains on complex multi-step tasks and under limited training data.

arxiv preprint arxiv, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2512.01715

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference

Tang, Jiaming, Sun, Yufei, Zhao, Yilong, Yang, Shang, Lin, Yujun, Zhang, Zhuoyang, Hou, James, Lu, Yao, Liu, Zhijian, Han, Song

arXiv.org Artificial IntelligenceDec-2-2025

Vision-Language-Action models (VLAs) are becoming increasingly capable across diverse robotic tasks. However, their real-world deployment remains slow and inefficient: demonstration videos are often sped up by 5-10x to appear smooth, with noticeable action stalls and delayed reactions to environmental changes. Asynchronous inference offers a promising solution to achieve continuous and low-latency control by enabling robots to execute actions and perform inference simultaneously. However, because the robot and environment continue to evolve during inference, a temporal misalignment arises between the prediction and execution intervals. This leads to significant action instability, while existing methods either degrade accuracy or introduce runtime overhead to mitigate it. We propose VLASH, a general asynchronous inference framework for VLAs that delivers smooth, accurate, and fast reaction control without additional overhead or architectural changes. VLASH estimates the future execution-time state by rolling the robot state forward with the previously generated action chunk, thereby bridging the gap between prediction and execution. Experiments show that VLASH achieves up to 2.03x speedup and reduces reaction latency by up to 17.4x compared to synchronous inference while fully preserving the original accuracy. Moreover, it empowers VLAs to handle fast-reaction, high-precision tasks such as playing ping-pong and playing whack-a-mole, where traditional synchronous inference fails. Code is available at https://github.com/mit-han-lab/vlash

artificial intelligence, asynchronous inference, inference, (13 more...)

arXiv.org Artificial Intelligence

2512.01031

Country: North America (0.28)

Genre: Research Report > Promising Solution (0.34)

Industry: Leisure & Entertainment (0.46)

Technology: Information Technology > Artificial Intelligence > Robots (1.00)

Add feedback

Mixture of Horizons in Action Chunking

Jing, Dong, Wang, Gang, Liu, Jiaqi, Tang, Weiliang, Sun, Zelong, Yao, Yunchao, Wei, Zhenyu, Liu, Yunhui, Lu, Zhiwu, Ding, Mingyu

arXiv.org Artificial IntelligenceNov-25-2025

Vision-language-action (VLA) models have shown remarkable capabilities in robotic manipulation, but their performance is sensitive to the $\textbf{action chunk length}$ used during training, termed $\textbf{horizon}$. Our empirical study reveals an inherent trade-off: longer horizons provide stronger global foresight but degrade fine-grained accuracy, while shorter ones sharpen local control yet struggle on long-term tasks, implying fixed choice of single horizons being suboptimal. To mitigate the trade-off, we propose a $\textbf{mixture of horizons (MoH)}$ strategy. MoH rearranges the action chunk into several segments with different horizons, processes them in parallel with a shared action transformer, and fuses outputs with a light linear gate. It has three appealing benefits. 1) MoH exploits long-term foresight and short-term precision jointly within a single model, improving both performance and generalizability to complex tasks. 2) MoH is plug-and-play for full-attention action modules with minimal training or inference overhead. 3) MoH enables dynamic inference with adaptive horizons, which selects stable actions through cross-horizon consensus, achieving 2.5$\times$ higher throughput than baselines while preserving superior performance. Extensive experiments over flow-based policies $π_0$, $π_{0.5}$, and one-step regression policy $π_{\text{reg}}$ demonstrate that MoH yields consistent and significant gains on both simulations and real-world tasks. Notably, under mixed-task setting, $π_{0.5}$ with MoH reaches a new state-of-the-art with 99$\%$ average success rate on LIBERO after only $30k$ training iterations. Project page: https://github.com/Timsty1/MixtureOfHorizons

artificial intelligence, arxiv preprint arxiv, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2511.19433

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (0.46)

Add feedback

SENTINEL: A Fully End-to-End Language-Action Model for Humanoid Whole Body Control

Wang, Yuxuan, Jiang, Haobin, Yao, Shiqing, Ding, Ziluo, Lu, Zongqing

arXiv.org Artificial IntelligenceNov-25-2025

Existing humanoid control systems often rely on teleoperation or modular generation pipelines that separate language understanding from physical execution. However, the former is entirely human-driven, and the latter lacks tight alignment between language commands and physical behaviors. In this paper, we present SENTINEL, a fully end-to-end language-action model for humanoid whole-body control. We construct a large-scale dataset by tracking human motions in simulation using a pretrained whole body controller, combined with their text annotations. The model directly maps language commands and proprioceptive inputs to low-level actions without any intermediate representation. The model generates action chunks using flow matching, which can be subsequently refined by a residual action head for real-world deployment. Our method exhibits strong semantic understanding and stable execution on humanoid robots in both simulation and real-world deployment, and also supports multi-modal extensions by converting inputs into texts.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2511.19236

Genre: Research Report > New Finding (0.46)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Vision (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Improving Pre-Trained Vision-Language-Action Policies with Model-Based Search

Neary, Cyrus, Younis, Omar G., Kuramshin, Artur, Aslan, Ozgur, Berseth, Glen

arXiv.org Artificial IntelligenceNov-14-2025

Pre-trained vision-language-action (VLA) models offer a promising foundation for generalist robot policies, but often produce brittle behaviors or unsafe failures when deployed zero-shot in out-of-distribution scenarios. We present Vision-Language-Action Planning & Search (VLAPS) -- a novel framework and accompanying algorithms that embed model-based search into the inference procedure of pre-trained VLA policies to improve their performance on robotic tasks. Specifically, our method biases a modified Monte Carlo Tree Search (MCTS) algorithm -- run using a model of the target environment -- using action priors defined by the VLA policy. By using VLA-derived abstractions and priors in model-based search, VLAPS efficiently explores language-conditioned robotics tasks whose search spaces would otherwise be intractably large. Conversely, by integrating model-based search with the VLA policy's inference procedure, VLAPS yields behaviors that are more performant than those obtained by directly following the VLA policy's action predictions. VLAPS offers a principled framework to: i) control test-time compute in VLA models, ii) leverage a priori knowledge of the robotic environment, and iii) integrate established planning and reinforcement learning techniques into the VLA inference process. Across all experiments, VLAPS significantly outperforms VLA-only baselines on language-specified tasks that would otherwise be intractable for uninformed search algorithms, increasing success rates by as much as 67 percentage points.

large language model, machine learning, reinforcement learning, (19 more...)

arXiv.org Artificial Intelligence

2508.12211

Country: North America > Canada (0.28)

Genre:

Research Report (1.00)
Overview (0.93)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)
(3 more...)

Add feedback